Computational Biology and Chemistry — Latest Matching Preprints

1

Glycine molecule radical: Predicted properties and dipeptide formation

Synak, J.; Blazewicz, J.

2026-07-10 bioinformatics 10.64898/2026.07.07.736934 medRxiv

Top 0.1%

4.1%

Show abstract

Numerous advances in quantum and computational chemistry over the last decades, well as the development of computer science, allowed utilisation of more precise and complex models, which can be now applied to much bigger systems than in the past. The authors used Gaussian, coupled with theoretical methods, to predict a new way of peptide bond formation, which could have taken place in prebiotic conditions. To better tackle this difficult task, the properties of substrates (glycine-derived radicals) were extensively analysed, using the aforementioned tool - Gaussian, paired with taking resonance and hybridisation into account, to better understand the stereochemistry and the very nature of processes taking place. The result is a series of reactions, which without any sophisticated catalysts and with relatively low energy thresholds ({inverted exclamation}20 kcal/mol) can lead to formation of dipeptides (and further, oligopeptides). The authors also hope, the other predicted properties of the investigated molecules can be of use to any researcher, who would like to utilise them in their experiments. Author summaryOur goal was to investigate a way first peptide bonds in prebiotic conditions could have been formed. This is an extremely important step in research into the beginning of life on Earth. We found a very promising series of reactions, which uses atomic hydrogen as its only catalyst and confirmed our expectations with theoretical calculations, using Gaussian. There are two radicals derived from glycine, which perform major roles in the process, so we investigated their properties with Gaussian and verified that the results are in agreement with our own theoretical considerations. This involved checking for possible geometric isomers and conformers and creating models which could explain their properties. We are well aware that such calculations have limitations and there is no model, which is 100% accurate, so our results should be further confirmed by empirical data in the future. However, we still to be as thorough as possible in how we approached the subject.

2

Super Learner Ensemble Modeling of CPTAC Proteomic Data for Survival Prediction in Head and Neck Squamous Cell Carcinoma

Park, E.; Lee, H.; Oh, E. J.; Tham, T.; Ahn, S.

2026-06-16 bioinformatics 10.64898/2026.06.11.731237 medRxiv

Top 0.1%

3.4%

Show abstract

Survival analysis in head and neck squamous cell carcinoma (HNSCC) is traditionally performed using Cox proportional hazards models, alongside some exploration into black-box machine learning methods. The Super Learner (SL) algorithm addresses this model selection dilemma by combining diverse candidate algorithms into a weighted ensemble to perform comparably to the best candidate method. This study evaluates the performance of SL in HNSCC. Proteomic features as well as clinical covariates from 96 CPTAC HNSCC samples were modeled with three candidate algorithms (Cox LASSO, Cox Ridge, and Random Survival Forest) as well as the ensemble SL method. Models were optimized via Unos time-dependent Concordance Index (C-index) and tested at 1- and 3-year time horizons using 2000 bootstrap resamples. The Cox Ridge regression model achieved the highest predictive accuracy among the four total methods. However, the SL demonstrated stable performance over both time horizons (1-year C-index: 0.985; 3-year C-index: 0.960). Variable importance analysis of the Cox Ridge model successfully identified malignant proteins (ATR, MAML1, MIEN1) alongside novel potential prognostic indicators (ZNF800, KERA). This analysis emphasizes the statistical necessity for larger cohorts for ensemble learning, while providing a benchmark of proteomic indicators in HNSCC.

3

Combining amino acid frequency and 1D convolutional neural network embeddings for the identification of protein-protein interactions using a random forest classifier

Sindhi, N. A.; Pawar, N.; Dixson, J.; Garcia, D.

2026-05-18 bioinformatics 10.64898/2026.05.15.725340 medRxiv

Top 0.1%

3.2%

Show abstract

Predicting protein-protein interactions is a fundamental problem in molecular biology. Experimental approaches for identifying protein-protein interactions are time-consuming and labor-intensive, motivating the development of efficient computational alternatives, including machine learning-based methods. However, conventional machine learning methods often rely on manually engineered features that require substantial domain expertise. In this study, we propose a two-stage framework to address these limitations. In the first stage, a one-dimensional convolutional neural network autoencoder is used to automatically learn latent representations from protein sequences. The quality of these features is evaluated through reconstruction error, reflecting how accurately the model reconstructs the original sequence. In the second stage, these learned features are combined with amino acid frequency-based features to form a hybrid feature set for predicting protein-protein interactions. A systematic comparison is performed between models trained on frequency features alone and those using a hybrid representation. The comparison showed that incorporating one-dimensional convolutional neural network-derived latent features improved the models performance of predicting protein-protein interactions. The dataset was split into training, validation, and test sets. Nested cross-validation was employed, with inner loops for hyperparameter tuning and outer loops for model selection. The random forest classifier achieved the best performance, with a mean receiver operating characteristic-area under curve of 0.91 and a test F1-score of 0.87. These results highlight the effectiveness of integrating deep feature learning with ensemble methods for predicting protein-protein interactions and build upon previous work focused on this fundamental problem. Author SummaryProtein-protein interactions are fundamental in all biological processes. However, predicting these interactions is a key problem in molecular biology. Computational approaches have been tested to address this problem. We applied a mix of machine learning and deep learning to gain insight into the qualities of proteins that engage in interaction. First, we trained a deep learning model, which automatically learned the primary sequence and characters related thereto, reducing bias in the actual prediction process. We combined these features, or latent representations, with amino acid frequency features of protein sequences, and called the two together "hybrid features." Then we performed a systematic comparison of amino acid frequency features-only with hybrid features, among four different machine learning classifiers. Our results suggest that the random forest classifier performed best among all four classifiers at predicting interactions between proteins. We propose that this approach could be used to improve efficiency in testing protein-protein interactions at the bench and may have applications to other biologically relevant molecular interactions.

4

Graph Neural Networks (GNNs) for Protein-Ligand Interaction Prediction

Khilar, S.; Natarajan, E.

2026-04-24 bioinformatics 10.64898/2026.04.23.720519 medRxiv

Top 0.1%

3.2%

Show abstract

Predicting protein-ligand interactions in the modern drug discovery has revolved from the involvement of artificial intelligence and structural bioinformatics using Graph Neural Networks (GNNs). The limited explainability of GNN models presents an important encumbrance in biomedical research, but it has achieved a high degree of accuracy in determining and identifying binding affinity and active compounds, as evidenced by [1] [2] [3] [4]. Here this research focuses on the interpretation of protein-ligand interactions at a molecular level, a rapidly developing area within Graph Neural Networks (GNNs). Now days modern study handling techniques such as visualization techniques, attention mechanism and model-based feature ascription by model to boost, and make robust and decrease false predictions on binding. Along with some approaches include like graph pooling strategies, message-passing optimization, self-supervised learning, transfer learning and contrastive learning are rapidly utilized to enhance the representative learnings. Furthermore, integration of molecular docking simulations, hybrid deep learning architectures and protein language model gives more reliable & biological predictions of protein-ligand interactions. That focuses on given process that identifies key ligand atoms and binding residues, as well as physicochemical factors influencing affinity, through chemical thought processes. Here this research work identified the challenges of developing biologically significant explanations, transparency, and the corollary dataset biases on interpretability. The research work conducted an in-depth investigation into the consolidation of protein language models to establish more reliable pathways for future research, examining hybrid architectures, transparent and energy-efficient GNNs, and scientifically grounded AI models for drug discovery. My research work highlights that XGNNs establishes a connection between Deep Learning and Biochemical expertise with increased confidence, which will enhance the accuracy of predictive models and computational models.

5

Integrated Analysis of HeberFERON-Driven Comparative Proteomic regulation in Glioblastoma Cells U-87MG

Vazquez-Blomquist, D.; Besada, V.; Miranda, J.; Ramos, Y.; Palomares, C. S.; Guirola, O.; Bringas, R.; Vonasek, E.; Gil, Y.; Perez, W.; Diaz, T.; Quinones-Vega, M.; Gonzalez, L. J.; Bello-Rivero, I.

2026-04-24 cancer biology 10.64898/2026.04.22.720155 medRxiv

Top 0.4%

1.8%

Show abstract

Glioblastoma is a very aggressive brain tumor with few therapeutics options. Type I and II Interferons (IFNs) co-formulation HeberFERON has been used in cancer treatment, with promising results in high grade brain tumors. High throughput techniques in easy-to-handle models have been important to interrogate biomolecules changes, describe mechanisms and find pharmacodynamic biomarkers. This study aims to elucidate the effect of HeberFERON over the cell proteome in comparison to its individual IFNs components. Proteomic changes with HeberFERON in the glioblastoma-derived cell line U-87MG, in comparison with individual IFN-2b and IFN-{gamma}, were studied using a nanoLC instrument EasyLC coupled to Velos Pro mass spectrometer; Maxquant and Perseus were also used. Several enrichment tools, networking analysis and canSAR for drug targets were employed. Translation, RNA processing, mitotic cell cycle, cytoskeleton and chromosome organization, apoptosis, autophagy, DNA repair are enriched to limit cellular growing together with changes in immune response components, supporting HeberFERON as a multitarget treatment. This co-formulation is distinguished at modulating RNA splicing with SMN complex, cytoskeleton organization and microtubule-based movement, nuclear envelope breakdown, DNA conformational changes, and oxidative phosphorylation, with a better drawing of effects over a variety of systems inside the tumoral cell. Together with previous microarray experiment, informative genes and proteins as pharmacodynamic biomarkers for antiproliferative effects showed up (ex. STAT1/2, CENPE, ATRIP, MAP1B, LIMA1, VCP, several ribosomal, spliceosome and proteasomal complexes proteins). This study complements transcriptomic and phosphoproteomic previous experiments in this model and underscore HeberFERON as a glioblastoma therapeutic.

6

Increased Ovarian and Colorectal Cancer Cell Sensitivity to Platinum Drugs and innovative TS Inhibitors by Electroporation

Marverti, G.; Belardo, A.; Mercanile, G.; Aiello, D.; Venturelli, A.; Costi, M. P.; D'Arca, D.

2026-05-26 cancer biology 10.64898/2026.05.21.726835 medRxiv

Top 0.4%

1.8%

Show abstract

Ovarian and colorectal cancers have the highest incidence and mortality in the world, after breast cancer. Despite the initial response to Pt-drugs or 5-fluorouracil (5-FU), many cancer cells develop resistance to these drugs. For this reason, new therapeutic strategies represent an important medical need, in particular for drugs that are being studied in combination with methods that promote their entry into the cell. Among these strategies, electrochemotherapy (ECT), the combination of drugs with electroporation (EP), a physical method that uses high-frequency electrical pulses to create pores into which chemotherapy drugs can permeate, is gaining interest. In this study, we have evaluated the effect of ECT on the growth of both ovarian (A2780 and A2780/CP) and colorectal (HCT116) cancer cell lines using platinum derivatives (Cisplatin, Carboplatin and Oxaliplatin), as DNA alkylating agents, and human thymidylate synthase (hTS) inhibitors, both traditional (5FU) and novel TS destabilizers (compounds E3 and E7). To this aim, synergism quotient-like analysis to determine whether electroporation gives an advantage in terms of cytotoxicity was applied to the relative IC20 and IC50 concentrations of each drug. Results showed that two of the three Pt-drugs have greater efficacy when combined with EP. 5-FU and the new TS inhibitors E3 and E7 also take advantage of ECT because EP increases drug uptake into the cell, even in resistant cells. In conclusion, ECT appears to be a viable strategy to obviate the problem of resistance in ovarian and colorectal cancers, to deliver compounds inside cells overcoming uptake limits, especially for the low lipophilic compounds whose cytotoxic efficacy is hampered by the obstacle of biological membranes.

7

Efficacy evaluation of glasedgib Sonic Hedgehog pathway inhibition with or without inotuzumab in B-ALL cells using a new co-culturing system model and a validated chemosensitivity assay

Woolston, D. W.; Churchill, M.; Grandori, C.; Advani, A.; Yeung, C. C. S.

2026-05-12 cancer biology 10.64898/2026.05.07.723573 medRxiv

Top 0.4%

1.6%

Show abstract

PurposeGlasdegib is a Sonic Hedgehog (SHH) pathway inhibitor used for treating newly diagnosed acute myeloid leukemia in elders or patients unfit for intensive chemotherapy. This study sought to demonstrate growth inhibition and increased apoptosis of B-cell acute lymphoblastic leukemia (B-ALL) in vitro under glasdegib, alone and combined with inotuzumab, using a novel co-culture system and validated chemosensitivity testing model to determine whether glasdegib with and without inotuzumab may represent a promising treatment strategy in B-ALL. MethodsSeven blood and marrow samples from B-ALL patients were co-cultured with HS-5 stromal cells in a co-culturing system designed to mimic the tumor microenvironment to maintain B-ALL cell viability for chemosensitivity testing under glasdegib and inotuzumab. ResultsCo-culturing improved B-ALL viability from four to nine days. Dosage-dependent responses to glasdegib were consistent among B-ALL samples on day four based on culture viability, and varied based on expressions of SSH genes GLI1, GLI3, SMO, and PTCH1. Combination with inotuzumab had varied effects on treatment response. ConclusionCo-culturing B-ALL cells with HS-5 stromal cells improves B-ALL growth and viability. Glasdegib with and without inotuzumab treatments impact the viability of co-cultured B-ALL cells by day four. SHH gene expressions suggest different B-ALL patients may be sensitive or resistant to glasdegib and inotuzumab.

8

WITHDRAWN: Integrative Transcriptomic Analysis Identifies Hypoxia-Responsive Cell Cycle Hub Genes as Prognostic Markers in Glioblastoma

Sharma, M. K.; Chongtham, J.; Bhushan, A.; Chosdol, K.; Sinha, S.; Srivastava, T.

2026-05-12 cancer biology 10.1101/2025.10.18.683218 medRxiv

Top 0.5%

1.5%

Show abstract

Glioblastoma (GBM) is the most aggressive primary brain malignancy, characterized by hypoxia-driven proliferation, therapeutic resistance, and poor prognosis. While hypoxia-induced transcriptional changes are well documented, the temporal regulation of cell cycle genes under sustained hypoxia remains unclear. This study profiled transcriptomic alterations in U87MG cells cultured under normoxia and graded hypoxia for one to three days. Differentially expressed genes (DEGs) were identified and analyzed using STRING, Cytoscape, MCODE, and CytoHubba to construct protein-protein interaction (PPI) networks and extract hub genes. Functional enrichment was assessed through DAVID, ClueGO, and KEGG, while prognostic relevance was evaluated using GlioVis and ONCOMINE datasets. qRT-PCR validated expression of selected hub genes. A total of 294 DEGs were identified, forming two main functional modules enriched in cell cycle regulation and chemokine signaling pathways. Eighteen hub genes (KIF20A, CCNB1, AURKA, EGR1, CDCA3, CENPF, CDCA2, ASPM, KIF11, CCL2, CCNA2, DLGAP5, RACGAP1, TPX2, PTGS2, CTGF, and KIFC1) were significantly associated with mitotic processes and GBM progression. Survival analysis demonstrated that 17 of these genes correlated with poor overall survival (p < 0.05). qRT-PCR confirmed that hub gene expression peaked during early hypoxia and declined with prolonged exposure, indicating dynamic regulatory adaptation. These findings identify key hypoxia-responsive genes governing cell cycle progression and highlight their prognostic and therapeutic potential in glioblastoma.

9

Learning from Drops: AI-Guided Integration of Liquid Biopsy Features in Cancer Studies

Andueza, M.; Villoslada-Blanco, P.; De Dreuille, B.; Alonso, L.; Sabroso-Lasa, S.; Pantel, K.; Alix-Panabieres, C.; Lopez de Maturana, E.; Malats, N.

2026-05-17 bioinformatics 10.64898/2026.05.12.724535 medRxiv

Top 0.6%

1.2%

Show abstract

Cancer is a major global health issue with rising incidence and mortality. Early detection, tumor characterization, and disease surveillance are crucial for timely and effective treatment, ultimately reducing mortality rates. Liquid biopsy (LB) has emerged as a valuable detection tool offering a non-invasive method to determine tumor-derived biomarkers in body fluids with demonstrated translational potential. To increase biomarker sensitivity, high-throughput sequencing platforms deliver massive volumes of data. Artificial Intelligence (AI) is pivotal in enabling huge and complex data integration. This contribution aims to assess the current state of integrative AI-based research in the LB field and provide methodological guidance. First, we conducted a PubMed search and found that the literature is sparse in studies integrating LB features, particularly by applying AI. When adopting the latter approach, defining the study objectives is crucial to guide the subsequent methodological aspects, including study design, patient selection criteria, sample size, nature of the LB features, and metadata to collect. Specifically, we propose strategies and tools for data preprocessing, including normalization and batch correction, as well as handling outliers and missing data. Furthermore, we recommend various Machine/Deep Learning approaches for feature selection techniques to ensure model robustness, and we highlight the importance of undergoing rigorous internal and external validations of the selected models. Assessing clinical utility and interpretability is often overlooked but fundamental for real-world implementation. In conclusion, we provide the LB scientific community with an AI-based methodological guidance to bridge the two fields and enhance the integrative analysis of LB features. Graphical abstractWorkchart for multiomics integrative studies in the liquid biopsy field. Note: CTCs, circulating tumor cells; ctDNA, circulating tumor-DNA; TEPs, tumor-educated platelets; miRNA, microRNA; cfRNAs, cell-free RNAs. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=159 SRC="FIGDIR/small/724535v1_ufig1.gif" ALT="Figure 1"> View larger version (45K): org.highwire.dtl.DTLVardef@1f250b2org.highwire.dtl.DTLVardef@18fe36corg.highwire.dtl.DTLVardef@19c02b9org.highwire.dtl.DTLVardef@176f6e0_HPS_FORMAT_FIGEXP M_FIG C_FIG

10

Multiple Fault Analysis and Drug Therapy on Signaling Pathways Using Dynamic Bayesian Network-based Model

Chowdhury, T.; Maitra, A.; Agarwal, A.; Sur, A.; Sarkar, S.; Majumder, S.; Lodh, E.

2026-06-15 bioinformatics 10.64898/2026.06.11.731601 medRxiv

Top 0.6%

1.2%

Show abstract

Cancer-associated signaling pathways often exhibit abnormal activation under simultaneous dysregulation of multiple molecular components. This study presents a probabilistic temporal Dynamic Bayesian Network (DBN)-based framework for analyzing multi-fault behaviour and intervention response in Growth Factor (GF) and Mitogen-Activated Protein Kinase (MAPK) signaling pathways. Unlike deterministic Boolean propagation, the proposed model represents each pathway component through an activation probability and propagates these probabilities over discrete time steps using soft-logic update rules. One-, two-, three-, and four-fault scenarios were systematically evaluated under a common lowest-burden input vector. The resulting output probabilities were summarized using an encoded pathway-burden score, and known-drug combinations were ranked using efficiency scores relative to no-intervention baselines. Pareto analysis was further used to balance intervention efficiency against drug-vector burden, while a custom dual-target search was performed to identify computational intervention hypotheses beyond predefined drug targets. Results showed that encoded burden increased with fault order in both pathways, with MAPK producing a higher baseline burden than GF. Among known-drug vectors, U0126+LY294002+Temsirolimus consistently emerged as the strongest low-burden candidate, achieving efficiency close to the maximum six-drug vector. Custom dual-target analysis identified ERK1/2+RPS6KB1 in GF and Raf+MEK1 in MAPK as high-impact computational target pairs. Runtime benchmarking showed that batched vectorized NumPy execution substantially improved scalability for higher-order fault simulations. Overall, the framework provides an interpretable and scalable approach for probabilistic pathway-level fault analysis and intervention prioritization.

11

Comparative Analysis of Relative Ligand Binding Free Energy Simulation Methods: Amber-TI, GROMACS-NETI, OpenMM-FEP, and BLaDE-MSLD

Lee, H.; Kim, I.; Kim, S.; Bae, M.; Jeong, B.; Kim, S.; Jo, S.; Lee, J.; Im, W.

2026-04-24 biophysics 10.64898/2026.04.22.720125 medRxiv

Top 0.7%

1.1%

Show abstract

Structure-based drug design has become increasingly important in the pharmaceutical industry for accelerating the discovery of effective drug candidates. In particular, ligand binding free energy serves as a critical metric for predicting drug efficacy during the key stages of hit discovery and lead optimization. Continuous progresses have been made in the prediction of ligand binding free energies, but direct comparisons of different methods using the same force field remain challenging due to their unique implementations into different simulation engines. In this study, we present a direct comparison of four popular methodologies (Amber-TI, GROMACS-NETI, OpenMM-FEP, and BLaDE-MSLD) for calculating relative binding free energies ({Delta}{Delta}Gbind) with the same Amber protein and ligand force fields using MolCube Alchemical Free Energy Simulator (MolCube-AFES), which provides an input generation workflow to support {Delta}{Delta}Gbind calculations of all four methods. We used 80 alchemical transformations (among the JACS benchmark set by Wang et al.) and two additional applications to compare the predicted {Delta}{Delta}Gbind from the four methods against experimental measurements. All four methods reproduced experimentally observed trends with most transformations within {+/-}2 kcal/mol from experiments and show broadly comparable accuracy with no statistically significant performance differences across the benchmark dataset. These results demonstrate that MolCube-AFES enables controlled, cross platform benchmarking and show that all four different alchemical free energy methods deliver statistically equivalent accuracy, with method selection guided by workflow requirements such as throughput, portability, and perturbation network design rather than expected differences in performances.

12

A Universal Immune Index (II): A Composite Quantitative Assessment Method and Calculation Tool for Immune Function Based on Multidimensional Routine Laboratory Parameters

zhang, Y.; LI, K.

2026-06-25 allergy and immunology 10.64898/2026.06.22.26356269 medRxiv

Top 0.7%

1.1%

Show abstract

Background: Quantitative assessment of immune function is essential for clinical and health decisions in oncology, post-surgical management, and autoimmune diseases. Existing methods are either too simplistic (single indicators) or too complex and costly for routine use. A standardized, easy-to-operate tool based on routine laboratory parameters is needed for both clinical and health checkup settings. Methods: We propose the Immune Index (II), integrating 9 routine laboratory parameters across three dimensions: humoral immunity (IgG, complement C3, C4), cellular immunity (CD4+ T cells, CD8+ T cells, CD4+/CD8+ ratio), and inflammatory response (CRP, IL-6, systemic immune-inflammation index [SII]). Indicators were normalized using min-max normalization to a 0-100 scale and aggregated with fixed weights (humoral 30%, cellular 40%, inflammatory 30%). The II score ranges from 0 to 100, with a healthy reference range of 50-80. Results: A four-tier grading system was established: >=80 (immune overactivation), 50-80 (immune homeostasis), 35-50 (mild immune suppression), <35 (severe immune deficiency). Validation using 209 cases from published literature showed an AUC of 0.924 (95% CI: 0.87- 0.97) for distinguishing normal from abnormal immune status, with an optimal cutoff of 47.8 (sensitivity 84.8%, specificity 85.9%). II scores were 56.7+/-8.6 (healthy), 43.5+/-8.0 (immunodeficient), and 33.6+/-6.5 (autoimmune), with P<0.001 between all groups. The calculation requires only two steps and can be implemented in Excel or LIS. II can serve as an immune dimension supplement for personal health checkups. Conclusion: The Immune Index provides a simple, standardized, and low-cost tool for quantitative immune function assessment. The fixed-weight design ensures cross-institutional comparability, making it suitable for outpatient clinics, health checkup centers, and primary care settings. Keywords: Immune index; immune function; quantitative assessment; routine laboratory parameters; composite score; min-max normalization

13

The Gompertz curve for estimating growth rates of Protein Data Bank and protein folds

Sato, K.; TOMII, K.

2026-06-26 bioinformatics 10.64898/2026.06.24.732253 medRxiv

Top 0.7%

1.1%

Show abstract

The Protein Data Bank (PDB) is an ever-growing, open-access repository of structural data of biological molecules. This international database has been instrumental in the development of artificial intelligence and deep learning models for protein structure prediction and design. The PDB growth is a crucially important factor influencing further development of these models. Therefore, after analyzing the growth trend in PDB depositions since the archive's launch, we found that it is well fitted by the Gompertz function, a growth curve used across various disciplines. Furthermore, we observed that the function captures the "discovery of novel folds", i.e., the cumulative number of distinct folds among protein domains that constitute most of the PDB. Consequently, based on the fitting results, we estimated the likely numbers of PDB entries and protein folds. These findings provide insights into deceleration of growth in recent years and enable us to assess anticipated trends.

14

Mechanistic Interpretability for Protein Language Models: A Validation Framework

Chon, P.; ANDREOPOULOS, W. B.

2026-06-02 bioinformatics 10.64898/2026.05.29.727021 medRxiv

Top 0.7%

1.1%

Show abstract

Protein language models (PLMs) are shown to be powerful predictors of protein structure and function but their internal mechanisms remain poorly understood. Recent mechanistic interpretability methods have decomposed PLM representations into interpretable features, but they have not combined methods on a single biologically meaningful task. This paper tests whether an InterPLM sparse autoencoder and ProtoMech cross-layer transcoder can discover features in ESM-2 (6 layers, 8M) that can mainly discriminate between Class A {beta}-lactamase and Class B {beta}-lactamase with class C and D used as more challenging comparisons. The main goal is to find distinct features for Class A {beta}-lactamase that are not shared by other classes. We find that both methods find distinct features for Class A {beta}-lactamase, but the cross-layer transcoders show that the concepts for Class A {beta}-lactamase seems to be distributed among nodes such as in layer 4 and 6 rather than one node. We also showcase a validation framework to prevent overclaiming the role of a node, and we use it to show that several strong nodes fail in some stages of the framework meaning that they cannot be the sole node that defines Class A {beta}-lactamase.

15

Structural distance at the tRNA synthetase active site interface predicts pathogenicity but is captured by AlphaMissense and EVE except among score-ambiguous variants

Liebeskind, K.; Francklyn, C.; Barrantes Reynolds, R.

2026-05-26 bioinformatics 10.64898/2026.05.22.727252 medRxiv

Top 0.7%

1.1%

Show abstract

Variants of uncertain significance have accumulated as genomic sequencing has become more widespread, which complicates rare disease diagnosis and requires substantial resources for re-evaluation. Aminoacyl-tRNA synthetases (ARSs) are a protein family with extensive variant data and well-characterized disease associations, making them an ideal system for investigating the relationship between variant location and pathogenicity. Using structural distance measurements to the ARS-tRNA binding interface combined with existing pathogenicity predictors, AlphaMissense and EVE, we investigated whether explicit structural binding information could improve missense variant pathogenicity prediction. Pathogenic variants were found to cluster significantly closer to the tRNA-binding interface than benign variants (p = 0.0003). Incorporating explicit distance information into a Bayesian mixture model did not substantially improve predictive performance over AlphaMissense and EVE alone, suggesting that these models already implicitly capture relevant structural binding context. However, a clinically important subset of interface variants classified as ambiguous by both existing models identifies a specific gap where explicit structural distance information may provide added discriminative value, but the limited number of clinically validated variants currently available constrains the ability to fully evaluate this potential. Incorporating additional biologically relevant features not captured by existing models, such as protein stability or conformational dynamics, as well as refining structural distance calculations, may further improve classification of this subset. These findings highlight both the power and the limitations of existing pathogenicity predictors and suggest that structurally informed approaches targeting the binding interface represent a promising direction for improving classification of these ambiguous variants that have great clinical significance. Author SummaryAdvances in clinical genetic sequencing have caused increasing identification of genetic variants whose impact on human health is unknown. These "variants of uncertain significance" present a major challenge because their role in causing disease cannot yet be confirmed or ruled out. This study focuses on a specific family of essential enzymes called aminoacyl-tRNA synthetases, which play a critical role in the process of proteins translation. Mutations in these enzymes have been linked to a range of diseases. This project aims to provide a novel method for determining pathogenicity of variants specifically in aminoacyl-tRNA synthetases. We propose that physical proximity of a variant to the functional binding site of the enzyme is influential in determining pathogenicity. We find that this spatial relationship is a meaningful indicator of a variants potential to disrupt normal function.

16

Characterization of ATM gene expression and evaluation of Reactive Oxygen Species in Silibinin-treated SKBR3 cells

Nademi, N. S.; Motamed, N.

2026-07-09 cancer biology 10.64898/2026.07.02.736131 medRxiv

Top 0.8%

1.0%

Show abstract

BackgroundReactive Oxygen Species (ROS) are the small, unstable and highly reactive species, having DNA oxidizing ability. Oxidation of the DNAs purine and pyrimidine bases can lead to single or double strands in this macromolecule. In this situation, the ATM molecule, a serine-threonine kinase, targets several proteins for phosphorylation, which causes the cell cycle to stop and the DNA damage repair begins. It has previously been proven that natural polyphenols have the cancer inhibiting properties due to their high efficacy and low side effects. Silibinin is the main herbal and medical ingredient in Milk Thistle (Silybum marianum) is a polyphenol flavonolignan, which has been widely considered as an antioxidant and anticancer agent. The purpose of the present study was to investigate the ATM gene expression and measurement of reactive oxygen species (ROS) in SKBR3 cell line, treated with Silibinin. Materials and MethodsAt first, the SKBR3 cell line was cultured in RPMI1640 culture medium and MTT assay was carried out to evaluate the Silibinin cytotoxicity. Flow Cytometry was carried out for cell cycle analysis, apoptotic induction, and ROS detection. While, Real Time PCR was used to evaluate the ATM gene expression in the Silibinin-treated and un-treated SKBR3 cells. ResultsPresent results have shown that 150 {micro}M Silibinin had the most significant cytotoxicity and apoptotic induction influence after the treatment period of 48 h. Flow cytometry data have shown that Silibinin induced considerable amount of apoptosis and caused cell cycle arrest at G1/S phase and induced production of ROS. Real-time PCR results have revealed that Silibinin increased the ATM expression in SKBR3 cell line. ConclusionSilibinin causes increased ATM gene expression by inducing ROS production, which initiates cell cycle arrest and apoptotic induction in SKBR3 cells line.

17

AptViralDB: A Repository of Experimentally Validated Antiviral Aptamers

Bajiya, N.; Singh, S.; Gahlot, P. S.; Raghava, G. P. S.

2026-07-11 bioinformatics 10.64898/2026.07.08.737144 medRxiv

Top 0.8%

1.0%

Show abstract

In an era of increasing drug resistance, exploring alternative molecules is crucial for the efficient management and treatment of viral diseases. Nucleic acid aptamers have emerged as highly promising candidates due to their exceptional target specificity, low immunogenicity, and versatile mechanisms for viral blocking. This manuscript describes AptViralDB, a manually curated database providing comprehensive information on experimentally validated antiviral aptamers. It contains 1,768 entries of antiviral aptamers against 40 viral species and 104 molecular targets, compiled from literature and existing databases. Each entry provides detailed annotations, including sequence, aptamer type, target, chemical modifications, binding affinity, antiviral activity, stability, and cytotoxicity. We also provide predicted secondary structures and their corresponding minimum free energy (MFE) values. Additionally, a knowledge graph created using ArcadeDB/openCypher enables users to seamlessly explore connections among aptamers, viruses, molecular targets, and biological activities. Finally, the platform offers advanced search and browsing tools, BLAST-based sequence similarity searches, GC-content analysis, downloadable datasets, and REST API access to support computational applications. (https://webs.iiitd.edu.in/raghava/aptviraldb/).

18

A Multi-Epitope Vaccine Design for Human Pasteurellosis using Outer Membrane β-barrel Proteins of Pasteurella multocida

Panda, A.; Kapoor, J.; Kumar, S.; Bandyopadhyay, A.

2026-06-01 bioinformatics 10.64898/2026.05.28.728361 medRxiv

Top 0.8%

1.0%

Show abstract

Pasteurella multocida is a facultative anaerobic, Gram-negative coccobacillus that causes pasteurellosis in companion animals (cats and dogs), livestock, and poultry. Close contact with infected animals poses a significant zoonotic risk to humans through bite wounds, scratches, licking and transfer of bodily fluids. Current treatment relies mainly on antibiotics, and the lack of a licensed human vaccine further exacerbates the challenge. In the present study, a consensus-based computational approach was employed on the P. multocida Past 9 proteome. A total of 29 outer membrane {beta}-barrel (OMBB) proteins, including TonB-dependent receptors, porins, autotransporters, adhesins and efflux pumps, were identified and used to design a multi-epitope vaccine (MEV) construct. B-cell and T-cell epitopes were predicted from the identified proteins. Ten epitopes each of cytotoxic T-lymphocyte (CTL) and helper T-lymphocyte (HTL), and three B-cell epitopes were selected based on their antigenicity, non-allergenicity, non-toxicity, surface accessibility, and conservation across eight P. multocida human-infecting strains. The MEV was supplemented with suitable adjuvants at the N-terminus to enhance its immunogenicity. The MEV construct, with a length of 459 amino acids, was predicted to be antigenic, non-allergenic, non-toxic and soluble upon expression. The MEV structural model was generated and subsequently validated, which indicated good structural quality. Molecular docking between MEV and human toll-like receptor 4 (TLR4) demonstrated strong binding affinity, and molecular dynamics simulation confirmed the structural stability of the MEV-TLR4 complex. Immune simulation of the MEV construct elicited a strong immune response. This study proposes a designed MEV candidate against human pasteurellosis and highlights OMBB proteins as potential immunogenic targets for vaccine development. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=132 SRC="FIGDIR/small/728361v1_ufig1.gif" ALT="Figure 1"> View larger version (54K): org.highwire.dtl.DTLVardef@320d63org.highwire.dtl.DTLVardef@d0ddeorg.highwire.dtl.DTLVardef@1099802org.highwire.dtl.DTLVardef@dab304_HPS_FORMAT_FIGEXP M_FIG C_FIG

19

E-InfertilityTest: An Explainable AI Framework for Male Infertility Assessment

Das, G.; Ghosh, B.; Ghosh, Z.

2026-05-25 bioinformatics 10.64898/2026.05.21.726746 medRxiv

Top 0.8%

1.0%

Show abstract

Male infertility has emerged as a significant concern in modern society, with genetic defects as one of the major underlying cause behind it. This impairment negatively impacts sperm motility and morphology, leading to conditions such as Asthenozoospermia (reduced sperm motility), Teratozoospermia (abnormal sperm morphology) and sometimes Asthenoteratozoospermia (both motility and morphology defects). Assisted reproductive technologies (ART), such as in-vitro fertilization (IVF), offer a potential solution for such cases but with a low success rate. Classical semen analysis provides only a phenotypic snapshot without revealing the fertilizing potential of the sperms. Hence, in order to screen the functional sperm population as well as to get a deeper insight into the reasons underlying the aberrant sperm population, it is important to study their genetic profile. In this work, we have performed a meta analysis of the transcriptomic data of infertile sperms from Asthenozoospermia and Teratozoospermia patients with that from fertile sperms of normal individuals. Thereafter we have screened a signature gene set which has been used to develop a prediction model named Explainable Infertility Test (E-InfertilityTest) to classify between fertile versus infertile sperm at the preliminary level. For each prediction, it will also provide the set of genes which are playing a dominant role towards such prediction. Thus, it will provide patient specific dominant gene expression profile responsible for the aberration. This work warrants validation experiments in future to substantiate the models performance in a clinical setting. User can access the tool named E-InfertilityTest as a standalone version on GitHub. Github Linkhttps://github.com/zglabDIB/einfertility.git

20

Deciphering the Molecular Structure of the Type III Secretion System in Chlamydia trachomatis for Structure-Based Therapeutic Targeting

Panda, A.; Kapoor, J.; Rajagopal, R.; Kumar, S.; Bandyopadhyay, A.

2026-05-09 bioinformatics 10.64898/2026.05.06.723290 medRxiv

Top 0.8%

1.0%

Show abstract

Chlamydia trachomatis is an obligate intracellular Gram-negative pathogen responsible for sexually transmitted infections and trachoma in humans. Although antibiotics are generally effective against acute infections, persistent chlamydial forms often exhibit reduced susceptibility during chronic infection. Chlamydia relies on its type III secretion system (T3SS) to inject effector proteins into host cells, making T3SS proteins attractive targets for antivirulence therapeutics. In this study, we employed an integrated computational pipeline to model and assemble the C. trachomatis T3SS constituent proteins. Template-based modeling using crystallographic structures of homologs from other Gram-negative bacteria revealed a highly conserved structural architecture despite low sequence identity (18-46%). Stereochemical validation confirmed high model quality, with most T3SS proteins exhibiting favorable protein-protein interactions (PPIs). Since the activity of the T3SS complex relies on extensive PPIs, we targeted these PPIs as a promising approach to attenuate bacterial virulence. CdsN, which functions as an ATPase of the T3SS, is a hexamer of which we targeted the dimerization interface. Structure-based virtual screening of compounds from the e-Drug3D and IMPPAT libraries against predicted hotspot residues and the identified druggable pocket at the CdsN dimeric interface, followed by ADMET screening, yielded three promising candidates: M Roflumilast (Drug ID: 1537), Elacestrant (Drug ID: 2081), and Tecovirimat (Drug ID: 1889). All three ligands formed thermodynamically stable complexes with the CdsN dimer, with Elacestrant demonstrating the most favourable binding free energy. This was also confirmed by 100 ns molecular dynamics simulation. This study provides new insights into the molecular architecture of C. trachomatis T3SS and identifies M Roflumilast, Elacestrant, and Tecovirimat as potential drug candidates against chlamydial infection. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=129 SRC="FIGDIR/small/723290v1_ufig1.gif" ALT="Figure 1"> View larger version (58K): org.highwire.dtl.DTLVardef@1821599org.highwire.dtl.DTLVardef@1581baaorg.highwire.dtl.DTLVardef@1805e98org.highwire.dtl.DTLVardef@c25e56_HPS_FORMAT_FIGEXP M_FIG C_FIG